TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect
نویسندگان
چکیده
AbstractPre-trained models have accomplished high performances with the introduction of Transformers like Bidirectional Encoder Representations from known for BERT. Nevertheless, most these proposed been trained on represented languages (English, French, German, etc.) and few target under-represented dialects.This work introduces a feasibility study pre-training language based Tunisian dialect as an languages. The model is evaluated identification task, sentiment analysis reading comprehension question-answering task. Results demonstrate that, instead using datasets traditional sources (Wikipedia, articles, etc.), noisy web crawled data more convenient such dialect. Additionally, experiments show that reasonably small-scale dataset conducts to similar or better achievements when large-scale TunBERT reach enhance state art in all three downstream tasks. pre-trained named used fine-tuning step are publicly released.KeywordsTransformersLanguage modelsUnder-represented languagesTunBERTBERT
منابع مشابه
Automatic Speech Recognition for Tunisian Dialect
Speech recognition for under-resourced languages represents an active field of research during the past decade. The tunisian arabic dialect has been chosen as a typical example for an under-resourced Arabic dialect. We propose, in this paper, our first steps to build an automatic speech recognition system for Tunisian dialect. Several Acoustic Models have been trained using HMM-GMM and HMM-DNN ...
متن کاملMorphological Analysis of Tunisian Dialect
In this paper, we address the problem of the morphological analysis of an Arabic dialect. We propose a method to adapt an Arabic morphological analyzer for the Tunisian dialect (TD). In order to do that, we create a lexicon for the TD. The creation of the lexicon is done in two steps. The first step consists in adapting a Modern Standard Arabic (MSA) lexicon. We adapted a list of MSA derivation...
متن کاملBuilding Ontologies to Understand Spoken Tunisian Dialect
This paper presents a method to understand spoken Tunisian dialect based on lexical semantic. This method takes into account the specificity of the Tunisian dialect which has no linguistic processing tools. This method is ontology-based which allows exploiting the ontological concepts for semantic annotation and ontological relations for speech interpretation. This combination increases the rat...
متن کاملAutomatic Detection of Transition Zones in Tunisian Dialect
This study is an extension of our last researches about the detection of transition zones based on multiresolution spectral analysis (MRS). In this paper we present the fourth step for the realization of an automatic system for Tunisian Dialect segmentation and analysis. The MRS is calculated over several Fast Fourier Transforms (FFT) of different length. It can provide a higher temporal accura...
متن کاملA Generative Model for Multi-Dialect Representation
In the era of deep learning several unsupervised models have been developed to capture the key features in unlabeled handwritten data. Popular among them is the Restricted Boltzmann Machines (RBM). However, due to the novelty in handwritten multi-dialect data, the RBM may fail to generate an efficient representation. In this paper we propose a generative model – the Mode Synthesizing Machine (M...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Communications in computer and information science
سال: 2022
ISSN: ['1865-0937', '1865-0929']
DOI: https://doi.org/10.1007/978-3-031-08277-1_23